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Abstract 

Normalized compression distance (NCD) is a parameter-free similarity measure based on compres- 
sion. The NCD between pairs of objects is not sufficient for all applications. We propose an NCD of finite 
multisets (multiples) of objacts that is metric and is better for many applications. Previously, attempts to 
obtain such an NCD failed. We use the theoretical notion of Kolmogorov complexity that for practical 
purposes is approximated from above by the length of the compressed version of the file involved, using 
a real-world compression program. We applied the new NCD for multiples to retinal progenitor cell 
questions that were earlier treated with the pairwise NCD. Here we get significantly better results. We 
also applied the NCD for multiples to synthetic time sequence data. The preliminary results are as good 
as nearest neighbor Euclidean classifier. 

Index Terms — Normalized compression distance, multisets or multiples, pattern recognition, data 
mining, similarity, Kolmogorov complexity, retinal progenitor cell classification, synthetic data classifi- 
cation 

I. Introduction 

The classical notion of Kolmogorov complexity fT3l is an objective measure for the information in 
an a single object, and information distance measures the information between a pair of objects 0. 
This last notion has spawned research in the theoretical direction, for example ||2D . and in the practical 
direction through the normalized compression distance, the similarity metric, which arises by normalizing 
the information distance in a proper manner and approximating the Kolmogorov complexity through real- 
world compressors ||T61 . (@), This normalized compression distance is a parameter-free, feature-free, and 
alignment-free similarity measure that has found many applications in pattern recognition, phylogeny, 
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clustering, and classification, for example 0/|, El> H2L Oil. 123, 10, CD, 123 • Another application 
is to objects that are only represented by name, or objects that are abstract like 'red,' 'Einstein,' 'three.' 
In this case the similarity metric uses background information provided by Google or any search engine 
that produces aggregate page counts. It discovers the 'meaning' of words and phrases in the sense of 
producing a relative semantics J5]. 

However, in many applications we are interested in shared information between many objects instead 
of just a pair of objects. In customer reviews of gadgets, in blogs about public happenings, in newspaper 
articles about the same occurrence, we are interested in the most comprehensive one or the most 
specialized one. Thus, the information distance measure has been extended from pairs to finite multisets. 
For many applications we require a normalized and computable version. For instance, classifying an 
object into one or another of disjoint classes we aim for the class of which the NCD for multiples 
grows the least. We applied the new NCD for multiples to retinal progenitor cell questions that were 
earlier treated with the pairwise NCD. Here we get significantly better results. We also applied the NCD 
for multiples to synthetic time sequence data. The preliminary results are as good as nearest neighbor 
Euclidean classifier. 

A. Related Work 

In ifTTl the notion is introduced of the information required to go from any object in a multiset of 
objects to any other object in the multiset. This is applied to extracting the essence from, for example, 
a finite multiset of internet news items, reviews of electronic cameras, tv's, and so on, in a way that 
works better than other methods. Let X denote a finite multiset of m finite binary strings defined by 
(abusing the set notation) X = {x\, . . . ,x m }, the constituting elements (not necessarily all different) 
ordered length-increasing lexicographic. We use multisets and not sets, since if X is a set then all of its 
members are different while we are interested in the situation were some of the objects are equal. Let 
U be the reference universal Turing machine, for convenience the prefix one as in Section JI] We define 
the information distance in X by E mstK (X) = min{|p| : U(xi,p,j) = Xj for all Xi,Xj € X}. It is shown 
in IfTTl . Theorem 2, that 

E max (X) = max K(X\x), (1.1) 
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up to a logarithmic additive term. Define E m \ n (X) = mm x:xe x K(X\x). Theorem 3 in ifTTl states that 

E m m(X) < E max (X) < min } E ma _ x (xi,x k ), (1.2) 

i:l<i<m L — ' 

Xi,x k £X & ky^i 

up to a logarithmic additive term. The paper [17] develops the stated results and applications. The infor- 
mation distance in |2] between strings x\ and X2 is denoted E max (xi, X2) = max.{K(xi\x2), K(x2\x{)}. 
Here we use the notation B miH (l) = maxj^gx K{X\x). The two coincide for \X\ = 2 since 
K(x,y\x) = K(y\x) up to an additive constant term. In |[24l the following results were obtained for 
multisets. The maximal overlap of information, concerning the remarkable property that the information 
needed to go from any member Xj to any other member x^ in a multiset X can be divided in two parts: 
a single string of length min* K (X\xi) and a special string of length maxj(if (X \xi) — minj K(X\xi) 
possibly depending on j and some logarithmic additive terms possibly depending on j, k. Furthermore, 
the minimal overlap property, the metricity property, the universality property, and the not-subadditivity 
property. With respect to normalization of the information distance of multisets abortive attempts were 
given. A review of some of the above is lTT8l . 

II. Preliminaries 

A. Kolmogorov Complexity 

The Kolmogorov complexity is the information in a single object lfl"3l . Informally, the Kolmogorov 
complexity of a finite binary string is the length of the shortest string from which the original can be 
losslessly reconstructed by an effective general-purpose computer such as a particular universal Turing 
machine. Hence it constitutes a lower bound on how far a lossless compression program can compress. 
For technical reasons we choose Turing machines with a separate read-only input tape that is scanned 
from left to right without backing up, a separate work tape on which the computation takes place, and 
a separate output tape. All tapes are divided into squares and are sem-infinite. Initially the input tape 
contains a semi-infinite binary string with one bit per square starting at the leftmost square. Upon halting, 
the initial segment p of the input that has been scanned is called the input "program" and the contents 
of the output tape is called the "output." By construction, the set of halting programs is prefix free. We 
call U the reference universal prefix Turing machine. This leads to the definition of "prefix Kolmogorov 
complexity" which we shall designate simply as "Kolmogorov complexity." 

Formally, the conditional Kolmogorov complexity K{x\y) is the length of the shortest input z such that 
the reference universal prefix Turing machine U on input z with auxiliary information y outputs x. The 
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unconditional Kolmogorov complexity K(x) is defined by K(x\e) where e is the empty string (of length 
0). In these definitions both x and y can consist of strings into which nonempty finite multisets of finite 
binary strings are encoded. 

Theory and applications are given in the textbook |[T9l . Here we give some relations that are needed 
in the paper. The information about x contained in y is defined as I(y : x) = K(x) — K(x\y). A deep, 
and very useful, result holding for both plain complexity and prefix complexity, due to L.A. Levin and 

A. N. Kolmogorov [29] called symmetry of information states that 

K(x, y) = K{x) + K(y\x) = K{y) + K(x\y), (HI) 

with the equalities holding up to a O(logif) additive term. Here, K = max{K(x), K(y)}. Hence, up to 
an additive logarithmic term I(x : y) = I(y : x) and we call this the mutual (algorithmic) information 
between x and y. 

B. Multiset 

A multiset is also known as bag, list, or multiple. A multiset is is a generalization of the notion of 
set. The members are allowed to appear more than once. For example, if x ^ y then {x, y} is a set, but 
{x, x, y} and {x, x, x, y, y} are multisets. We abuse the common set-membership notation by using it for 
multisets by writing x € {x, x, y} and z {x, x, y} for z ^ x,y. Further, {x, x, y} \ {x} = {x, y}. If 
X, Y, Z axe multisets and X, Z are nonempty and X = YZ, then we write Y C X. For us, a multiset is 
finite such as {x\, . . . ,x m } with m < oo and the members are finite binary strings in length-increasing 
lexicographic order. If X is a multiset, then some or all of its elements may be equal. Thus, X{ G X 
means that "x{ is an element of multiset X." With {x±, . . . , x m+ i} \ {x} we mean the length-increasing 
lexicographic concatenation of x\ . . . x m+ i with one occurrence of x removed. 

The finite binary strings, finiteness, and length-increasing lexicographic order allows us to assign a 
unique Kolmogorov complexity to a multiset. The conditional prefix Kolmogorov complexity K(X\x) 
of a multiset X given an element x is the length of a shortest program p for the reference universal 
Turing machine that with input x outputs the multiset X. The prefix Kolmogorov complexity K(X) of 
a multiset X is defined by K(X |e). One can also put multisets in the conditional such as K(x\X) or 
K(X\Y). We will use the straightforward laws K(-\X,x) = K{-\X) and K(X\x) = K(X'\x) up to an 
additive constant term, for x € X and X' equals the multiset X with one occurrence of the element x 
deleted. 
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C. Information Distance 

The information distance in a multiset X (\X\ > 2) is given by (II. lb . To obtain the pairwise information 
distance in O we take X = {x\,X2) in (II. lb . The resulting formula is equivalent to E max {x\, x-i) = 
max{fT(xi|x2), K(x2\x\)} up to a logarithmic additive term. 



Let X be the set of length-increasing lexicographic ordered finite multisets of finite binary strings. A 
distance function d on X is defined by d : X — > 1Z + where 1Z + is the set of nonnegative real numbers. 
Define Z = XY if Z is a multiset of the elements of the multisets X and Y and the elements of Z are 
ordered length-increasing lexicographic. A distance function d is a metric if 

1) Positive definiteness: d(X) = if all elements of X are equal and d(X) > otherwise. 

2) Symmetry: d(X) is invariant under all permutations of X. 

3) Triangle inequality: d(XY) < d(XZ) + d(ZY). 
We recall Theorem 4.1 and Claim 4.2 from [24- 1 . 

Theorem 2.1: The information distance for multisets -Emax is a metric where the (in)equalities hold 
up to a O (log IT) additive term. Here K is the largest quantity involved in each metric (in)equality 1) 
to 3), respectively. 

Claim 2.2: Let X,Y, Z be three multisets of finite binary strings and K = K(X) + K(Y) + K(Z). 
Then, E max (XY) < E max (X Z) + E max (ZY) up to an O(logif) additive term. 

III. Normalized Information Distance 

The quantitative difference in a certain feature between many objects can be considered as an admissible 
distance, provided it is upper semicomputable and satisfies a density condition for every x € {0, 1}* (to 
exclude distances like D{X) = 1/2 for every multiset X): 



Thus, for the density condition on D we consider only multisets X with \X\ > 2 and not all elements of 
X are equal. Moreover, we consider only distances that are semicomputable, that is, they are computable 
in some broad sense (they can be computably approximated from above). Theorem 5.2 in E4l shows 
that E m3iX is universal in that among all admissible multiset distances in that it is always least up to 



D. Metricity 





(III.l) 



x-.xex & D(x)>o 
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an additive constant. That is, it accounts for the dominant feature in which the elements of the given 
multiset are alike. 

Admissible distances as defined above are absolute, but if we want to express similarity, then we are 
more interested in relative ones. For example, if a multiset X of strings of each about 1,000,000 bits have 
pairwise information distance 1,000 bits to each other, then we are inclined to think that those strings are 
relatively similar. But if a multiset Y consists of strings of each about 1,200 bits and each two strings in 
it have a pairwise information distance of 1,000 bits, then we think the strings in Y are very different. 
In the first case E meLX (X) « 1,000|X[ + O(l), and in the second case E max (Y) « 1,000|Y| + 0(1). In 
case |X| :=s \Y\ the information distances of the multsets are about the same. 

Therefore, to express similarity we need to normalize the universal information distance E m3iX to obtain 
a universal similarity distance. It should give a similarity with distance when the objects in a multiset are 
maximally similar (that is, they are equal) and distance 1 when they are maximally dissimilar. Naturally, 
we desire the normalized version of the universal multiset information distance metric to be also a metric. 

For pairs of objects x, y the normalized version e of £? max defined by 

/ \ = ffmax(ff, y) = mayi{K(x,y\x),K(x,y\y} 
e[X,V> max{K(x),K(y)} max{K (x) , K (y)} { " ' 

takes values in [0,1] up to an additive term of 0(1/ K(x,y)). It is a metric up to additive terms 
0((log K )/K ) with K denotes the maximum of the Kolmogorov complexities involved in each of the 
metric (in)equalities, respectively. A normalization factor for multisets of more than two elements ought 
to reduce to that of (IIII.2I ) for multisets restricted to two elements. Consider strings consisting of the 
concatenation of finite multisets of finite strings. 

Remark 3.1: For example set X = {x},Y = {y,y},Z = {y},K(x) = n,K(x\y) = n,K(y) = 
0.9n and by using ( |II.1| > we have K(x,y) = 1.9n, K(y\x) = 0.9ra. The most natural definition is a 
generalization of (IIII.2I ): 

e(A) - EmUA) 



max x( z A {K (A \{x})}' 

But we find e(XY) = K(x\y) / K(x,y) = n/1.9n w 1/2, and e(XZ) = K(x\y)/K(x) = n/n = 1, 
e(ZY) = K(y\y) / K(y) = 0/0. 9n = 0, and the triangle inequality is violated. Intuitively, if we add 
an element to a multiset of objects then a program to go from any object in the new multiset to any 
other object should be at least as long as a program to go from any object in the old multiset to any 
other obkect. This suggests a definition of e(A) that is nondecreasing when we add elements to A. 
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This leads to (UTOl) . With B C A with B = A \ {y} we find e{XY) = K{x\y)/K{x) = n/n = 1, 
e(XZ) = K{x\y) / K{x) = n/n = 1 and e(ZY) = 0/n = 0. The triangle inequality holds. 
The reasoning in the remark points the way to go: the definition of e(X) with \X\ > 2 should be 
monotonic nondecreasing in \X\ if we want e to be a metric. 

Lemma 3.2: Let U, X be multisets and d be a distance that satisfies the triangle inequality. If U C X 
then d(U) < d(X). 

Proof: Let A, B, C be multisets with A,BCC, and d a distance that satisfies the triangle inequality. 
Assume that the lemma is false and d(C) < d(AB). Let D = C\A. It follows from the triangle inequality 
that 

d(AB) < d(AD) + d(DB). 

Since AD = C this implies d(AB) < d(C) + d(DB), and therefore d(C) > d{AB). But this contradicts 
the assumption. ■ 
Definition 3.3: Let X be a multiset. Define the normalized information distance (NID) for multiples 
by e(X) = for \X\ = 0,1 and 

e(X) = max! ^ffi ? i vi ■ max{e(y)}| (III.3) 

\max xeX {K(X\{x})} Ycx 1 n ) 

For |X| = 2 the value of e(X) is equivalently given in (IIII.2I) . 

Thus, (IIII.3b satisfies the property in Lemma [3^2] If X, Z are multisets and ZCI then e(Z) < e(X). 
Therefore we can hope to prove the triangle property for (IIIL3I ). 

Theorem 3.4: For every multiset X we have < e(X) < 1 up to an additive term of 0(1/ K) where 
K = K(X). 

Proof: By induction on n = \X\. The theorem is vacuously true for n = 0,1. 

Base case: n = 2. The definition of e(X) is (IIII.2I) . The proof of the lemma for this case is in ifToll . 

Induction n > 2: Assume that the lemma is true for the cases 2 < \X\ < n. Let \X\ = n. If 
e(X) = maxy c x{e(y)} then the lemma holds by the inductive assumption since |K| < n. So assume 
that 

,ys max x&x {K(X\x)} 
6[ ' ~ m^ xeX {K(X\{x})}- 

Since the numerator is at most the denominator up to an 0(1) additive term and the denominator at most 
K(X) the lemma holds for this case. ■ 
Remark 3.5: The least value of e(X) is reached if all occurrences of alements of X are equal. 
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In that case < e(X) < 0(l)/K(X). The greatest value e(X) = 1 + 0{l/K(X)) is reached 
if max xeX {K(X\x)} = ma,x xeX {K(X \ {x}\x)} + O(l) = max xeX {K(X \ {x}) + 0(1). For 
example, this happens if the selected conditional, say y, has no consequence in the sense that 
K(X \ {y}\y) = K(X \ {y}) + 0(1). This happens if K(z\y) = K(z) for all z E X \ {y}. 

Another matter is the consequences of Theorem 13.41 Using (III. lb in the first equality in both the 
numerator and the denominator. Then we obtain up to additive constants in the numerator and denominator 

max xeX {K(X\x)} = K (X) - mm xeX {K (x)} 
max xeX {Kpf \ {x})} K[X) - min x&x {K{x\X \ {x})} 

_ mm x€X {K(x)} - mm x( z X {K(x\X \ {x})} 
K(X)-mm x&x {K(x\X\{x})} 

This expression goes to 1 if both 

K(X)^oc, g^^Ml ^Q. 

K{X) 

This happens, for instance, if X — > {0, l} n and n — > oo. Another example is |X| = n, mm xeX = 
0, K(X) > n 2 , and n — > oo. One can only have K(X) — > oo and mm x( z X {K(x)} / 'K(X) — > 
if m.in x&x {K(x)} = o(K(X)) and max xeX {K (x)} = U(K(X)), that is, if X consists of at least 
two elements and gap between the minimum Kolmogorov complexity and the maximum Kolmogorov 
complexity of the elements grows to infinity when K(X) — > oo. 

Definition 3.6: Let x u and xy be defined such that K(U \ {x u }) = max x& u{K(U \ {x})}, and 
K{V\x v ) = max x£V {K(V\x)}. 

Theorem 3.7: For X is a multiset. The function e(X) is a metric up to an additive 0((logK)/K) 
term in the respective metric (in)equalities, where K is the largest Kolmogorov complexity involved the 
(in)equality. 

Proof: The quantity e{X) satisfies positive definiteness and symmetry up to an 
0((log K{X)) / K{X)) additive term, as follows directly from the definition of e(X). It remains 
to prove the triangle inequality: 

Let X, Y, Z be finite multisets. Then, e(XY) < e(XZ) + e(ZY) within an additive term of 
0{{\ogK)/K) where K = max{K(X),K(Y),K(Z)}. 

The proof proceeds by induction on n = \XY\. The cases n = 0, 1 are vacuously true. 

Base case n = 2. The definition of e(XY) with XY is a multiset of cardinality 2 is (IIII.2I ). The proof 
of the lemma for this case is in |[T6l . 



8 



Induction n > 2. Assume that the lemma is true for the cases 2 < \XY\ < n. Let \XY\ = n. 
If e(XY) = maxzcXY{e(Z)} then the lemma holds by the inductive assumption since \Z\ < n. So 
assume that 

e(XY)- *<«W) 



K(XY\{x xy }Y 



Claim 3.8: Let X,Y,Z be nonempty multisets. K{XYZ\x X yz) < K(XZ\x xz ) + K(ZY\x ZY ) up 
to an additive O(logK) term, where K = K(X) + K{Y) + K(Z). 

Proof: By Theorem 12. 11 we have that E max is a metric. In particular, the triangle inequality is satisfied 
by Claim 1251 K(XY\x X y) < K(XZ\x xz ) + K(ZY\x Z y) for multisets X,Y, Z up to an additive term 
of OQogK) where K = K{X) + K{Y) + K(Z). Thus with X' = XZ and Y' = ZY we have 
K(X'Y'\xx>y>) < K(X'Z\xx'z) + K(ZY'\x Z y>) up to the logarithmic additive term. Writing this out 
K{XZZY\x XZZ y) < K{XZZ\x XZZ )+K{ZYZ\xzyz) or K{XYZ\x XYZ ) = K(XZZY\x XZ zy) < 
K(XZ\x xz ) + K(ZY\x Z y) up to an additive term of 0(\ogK). ■ 
Now consider the following sequence of inequalities: 

K(XYZ\x XYZ ) K(XZ\x xz ) K(ZY\x ZY ) 

K(XYZ \ {x xyz }) ~ K(XYZ \ {x xyz }) + K(XYZ \ {x xyz }) 

K(XZ\x xz ) K{ZY\xzy) 
~ K(XZ \ {x xz }) + K{ZY \ {x zy }) ' 

up to a 0((log K)/K) additive term. The first inequality is Claim 13781 (by this inequality the denominator 
is unchanged); the second inequality follows from K(XY Z\{x xyz }) > K(XZ\{x xz }) and K(XYZ\ 
{ x xyz}) ^ K(ZY\{x zy }) using the principle that K(u,v) > K(u) + 0(1), reducing both denominators 
increases the sum of the quotients (by this inequality the numerators are unchanged). 

Claim 3.9: Let A be a multiset. If e(A) < 1 then there is an B with Ac B such that e(A) < e(B). 
Proof: Let e = 1 - e{A). Define B = A[j{b} such that e(B) > 1 - e/2. Namely, with K (b) > \b\ 
and the length of b growing, we have by (IIII.3b that e(B) > e(A). In particular, we can choose b such 
that e(B) > e(A) + e/2. ■ 
Assume e(XY) = 1. Then e(XYZ) = 1 since e(A) < 1 for every multiset A and by (IIII.3b we have 
e(B) > e(A) for A C B. Assume e(XY) < 1. By Claim EH we have e(XY) < e(XYU) for some 
nonempty multiset U. Hence by the definition (IIII.3b there is a nonempty multiset Z with Z C U such 



9 



that 

K(XY\x XY ) K(XYZ\x XYZ ) 
K(XY \ {x xy }) ~ K(XYZ \ {x xyz }) 

Together with (IIII.4I ) this proves the triangle inequality up to an additive term of 0((log K)/K). ■ 
By Theorems 13.41 and 13.71 the distance according to (IIII.3I ) is a metric with values in [0, 1] up to some 
ignorable additive terms.. 

IV. Compression Distance for Multisets 

We develop the compression-based equivalence of the Kolmogorov complexity based theory in the 
preceding sections. This is similar to [4| for the case \X\ = 2. We assume the notion of the real-world 
compressor G used in the sequel is "normal" in the sense of [4]. By G(x) we mean the length of string 
x when compressed by G. Consider a multiset A" as a string consisting of the concatenated strings of its 
members ordered length-increasing lexicographic. Thus we can write G(X). 

Let X = {x±, . . . ,x m }. The information distance E m&yi {X) can be rewritten as 

max{K(X) - K( Xl ), . . . , K{X) - K(x m )}, (IV.l) 

within logarithmic additive precision, by (III. It . The term K(X) represents the length of the shortest 
program for X. The order of the members of X makes only a small difference; block-coding based 
compressors are symmetric almost by definition, and experiments with various stream-based compressors 
(gzip, PPMZ) show only small deviations from symmetry. 

Approximation of £' max (A) by a compressor G is straightforward: it is 

£G,max(A) = max{G(X) - G(zi), • • • , G(X) - G{x m )} = G(X) - mm{G(x)}. (IV.2) 

We need to show it is an admissible distance and a metric. 

Lemma 4.1: If G is a normal compressor, then i?G max (A) is an admissible distance. 

Proof: For -E'G,max(A) to be an admissible distance it must satisfy the density requirement QUI. ID 
and be upper semicomputable. Since the length G(x) is computable it is a fortiori upper semicomputable. 
The density requirement (1HI.1D is equivalent to the Kraft inequality |[T4l and states in fact that for every 
string x the set of £ , G ]max (X) is a prefix-free code for the A's containing x. According to (IIV.2D we have 
for any fixed x G X: E G , max (X) > G(X) - G(x) > G(X \ {x}). Hence, 2- Eg ^ W < 2- G(x \^ 



10 



and therefore 



-E G , 



(x) < 2 -G(x\{x})_ 



X:x&X 



X:x&X 



A compressor G compresses strings into a uniquely decodable code (it must satisfy the unique 
decompression property) and therefore the length set of the compressed strings must satisfy the Kraft 
inequality |[20l . Then, for a fixed given x the compressed code for the multisets X \ {x} must satisfy 



Lemma 4.2: If G is a normal compressor, then i?G,max(^)) is a metric with the metric (in)equalities 
satisfied up to logarithmic additive precision. 

Proof: Let X, Y, Z be multisets with at most m members of length at most n. The positive 
definiteness and the symmetry property hold clearly up to an 0(logG(X))) additive term. Only the 
triangular inequality is nonobvious. For every compressor G we have G(XY) < G(X) + G(Y) up 
to an additive 0(log(G(X) + G{Y))) term, otherwise we obtain a better compression by dividing the 
string to be compressed. (This also follows from the distributivity property of normal compressors.) By 
the monotonicity property G(X) < G(XZ) and G{Y) < G(YZ) up to an 0(log(G(X) + G(Y))) or 
0(log(G(y) + G{Z))) additive term, respectively. Therefore, G{XY) < G(XZ) + G(ZY) up to an 



Let X be a multiset. The normalized version of e(X) using the compressor G based approximation of 
the normalized information distance for multisets (HII.3K is called the normalized compression distance 
(NCD) for multisetsmultiples: NCD(X) = for \X\ = 0, 1; if \X\ > 2 then 



Here X \ {x} denotes the string consisting of the length-increasing lexicographic elements of X with 
one occurrence of the substring x removed. 

This NCD is the main concept of this work. It is the real-world version of the ideal notion of 
normalized information distance NID for multiples in (IIII.3I ). 

Remark 5.1: In practice, the NCD is a non-negative number < r < 1 + e representing how different 
the two files are. Smaller numbers represent more similar files. The e in the upper bound is due to 
imperfections in our compression techniques, but for most standard compression algorithms one is unlikely 



this inequality. Hence the right-hand side of above displayed inequality is at most 1. 



0(log(G(X) + G{Y) + G{Z))) additive term. 



V. Normalized Compression Distance for Multisets 




(V.l) 
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to see an e above 0.1 (in our experiments gzip and bzip2 achieved NCD's above 1, but PPMZ always 
had NCD at most 1). 
Theorem 5.2: If the compressor is normal, then the NCD for multiples is a normalized admissible 
distance and satisfies the metric (in)equalities up to an ignorable additive term, that is, a similarity 
metric. 

Proof: The NCD (IV. II) is a normalized admissible distance by Lemma |4~T1 It is normalized to [0, 1] 
up to an additive term of 0(1/6/) with G = G(X) as we can see from the formula (IV. It and Theorem 13.41 
with G substituted for K throughout. We next show it is a metric. 

A normal compressor is idempotent in the sense that NCD(X) = if X consists of equal members. 
The idempotency property of a normal compressor is up to an additive term of 0(log G(X)). Hence the 
positive defmiteness of G(X) is satisfied up to an additive term of 0((logG(X))/G(X)). The order 
of the members of X is assumed to be length-increasing lexicographic. Therefore it is symmetric up 
to an additive term of 0((log G(X))/G(X)). It remains to show the triangle inequality NCD(XY) < 
NCD(XZ) + NCD(ZY) up to an additive term of 0((log G) /G) where G = G(X) + G(Y) + G(Z). 
We do this by induction on n = \XY\ where X,Y are multisets. For n = 0, 1 the triangle property is 
vacuously satisfied. 

Base case n = 2. That is, \XY\ = 2. This is proved in EJ. 

Induction n > 2. Assume the triangle property is satisfied for 2 < \XY\ < n. Then we prove it for 
\XY\ = n. If NCD(XY) = NCD(Z) for some Z C XY then 2 < \Z\ < n and the case follows 
from the inductive argument. Therefore, NCD(XY) is the first term in the outer maximization of (IV 1) . 
Write G(XY\x X y) = G(XY) - mm xeXY {G(x)} and G(XY \ {x xy }) = ui^ XY {G{XY) \ {x}} 
and similar for XZ, YZ, XY Z. Following the proof of the induction case of the triangle inequality in 
the proof of Theorem 13.71 using Lemma 14.21 for the metricity of Ea ; raax wherever Theorem 12.11 is used 
to assert the metricity of E max , and substitute G for K in the remainder. This completes the proof. That 
is, for every multiset Z we have 

NCD(XY) < NCD(XZ) + NCD(ZY), 

up to an additive term of 0((log G)/G). ■ 
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VI. Computing the Normalized Compression Distance for Multsets 

In practice it seems that one can do no better than follow the definition inductively. Assume we want 
to compute NCD(X) and \X\ > 2. 

Base Step: Compute M 2 = m a x Ycx, \y\=2{NCD(Y)}, as m 0- 

Induction: 2 < m < n. Let M m = max z {e(Z) : Z C X, 2 < \Z\ < m}). Then with Z C Y (1 X 
and \Y\ = m + 1: 



We ignore logarithmic additive terms. With \Y\ = n we have Y = X and hence NCDiY) = NCD(X). 
However, this process involves evaluating the NCD's of the almost the entire powerset of X. 

Natural Data and Kolmogorov Complexity: The Kolmogorov complexity of a file is a lower bound 
on the length of the ultimate compressed version of that file. In both cases above we approximate the 
Kolmogorov complexities involved by a real-world compressor. Since the Kolmogorov complexity is 
incomputable, in the approximation we never know how close we are to it. However, we assume that the 
natural data we are dealing with contain no complicated mathematical constructs like ir = 3.1415 ... or 
Universal Turing machines. In fact, we assume that the natural data we are dealing with contains only 
effective regularities that a good compressor finds. Under those assumptions the Kolmogorov complexity 
of the object is not much smaller than the length of the compressed version of the object. 



We detail preliminary results using the new NCD for multiples. The NCD for pairs as originally defined 
|4l has been applied in a wide range of application domains. In @| a close relative was compared to every 
time series distance measure published in the decade preceeding 2004 from all of the major data analysis 
conferencea and found to outperform all other distances aside from the Euclidean distance with which it 
was competitive. The NCD for pairs has also been applied in biological applications to analyze the results 
of segmentation and tracking of proliferating cells and organelles Q, 126*1 . Here, we compare the 
performance of the proposed NCD for multiples to that of a previous application of the NCD for pairs 
for predicting retinal progenitor cell (RPC) fate outcomes from the segmentation and tracking results 
from live cell imaging. We also apply the proposed NCD to a synthetic time sequence data set iflQl . 




VII. Applications 
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A. Retinal Progenitor Cell Fate Prediction 

In Q, long-term time-lapse image sequences showing rat RPCs were analyzed using automated 
segmentation and tracking algorithms. Images were captured every five minutes of the RPCs for a period 
of 9-13 days. Up to 100 image sequences may be captured simultaneously in this manner using a 
microscope with a mechanized stage. At the conclusion of the experiment, the "fate" of the offspring 
produced by each RPC was determined using a combination of cell morphology and specific cell-type 
fluorescent markers for the four different retinal cell types produced from embryonic day 20 rat RPCs |f3]. 
At the conclusion of the imaging, automated segmentation and tracking algorithms ||25| were applied to 
extract the time course of features for each cell. These automated segmentation and tracking algorithms 
extract a time course of feature data for each stem cell at a five-minute temporal resolution, showing the 
patterns of cellular motion and morphology over the lifetime of the cell. Specifically, the segmentation and 
tracking results consisted of a 6-dimensional time sequence feature vector incorporating two-dimensional 
motion (d x,d y), as well as the direction of motion, total distance travelled, cellular size or area (in 
pixels) and a measure of eccentricity on [0, 1] (0 being linear, 1 being circular shape). The time sequence 
feature vectors for each of the cells are of different length and are not aligned. The results from the 
segmentation and tracking algorithms were then analyzed as follows. 

The original analysis of the RPC segmentation and tracking results used a multiresolution semi- 
supervised spectral analysis based on the originally formulated pairwise NCD. An ensemble of distance 
matrices consisting of pairwise NCDs between quantized time sequence feature vectors of individual 
cells is generated for different feature subsets / and different numbers of quantization symbols n for 
the numerical time sequence data. The fully automatic quantization of the numeric time sequence data 
is described in JfJ). All subsets of the 6-dimensional feature vector were included, although it is possible 
to use non-exhaustive feature subset selection methods such as forward floating search, as described in 
floll . Each distance matrix is then normalized as described in |7], and the eigenvectors and eigenvalues 
of the normalized matrix are computed. These eigenvectors are stacked and ordered by the magnitude 
of the corresponding eigenvalues to form the columns of a new "spectral" matrix. The spectral matrix 
is a square matrix, of the same dimension N as the number of stem cells being analyzed. The spectral 
matrix has the important property that the zth row of the matrix is a point in 1Z N {1Z is the set of real 
numbers) that corresponds to the quantized feature vectors for the ith stem cell. If we consider only the 
first k columns, giving a spectral matrix of dimension N x k, and run a K-Means clustering algorithm, 
this yields the well-known spectral K-Means algorithm |8]. If we have known outcomes for any of the 
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objects that were compared using the pairwise NCD, then we can formulate a semi-supervised spectral 
learning algorithm by running for example nearest neighbors or decision tree classifiers on the rows of 
the spectral matrix. This was the approach adopted in |0. 

In the original analysis, three different sets of known outcomes were considered. First, a group of 72 
cells were analyzed to identify cells that would self-renew (19 cells), producing additional progenitors 
and cells that would terminally differentiate (53 cells), producing two retinal neurons. Next, a group 
of 86 cells were considered on the question of whether they would produce two photoreceptor neurons 
after division (52 cells), or whether they would produce some other combination of retinal neurons (34 
cells). Finally, 78 cells were analyzed to determine the specific combination of retinal neurons they would 
produce, including 52 cells that produce two photoreceptor neurons, 10 cells that produce a photoreceptor 
and bipolar neuron, and 16 cells that produced a photoreceptor neuron and an amacrine cell. For the 
terminal versus self-renewing question, 99% accuracy was achieved in prediction using a spectral nearest 
neighbor classifier. For the two photoreceptor versus other combination question, 87% accuracy was 
achieved using a spectral decision tree classifier. Finally, for the specific combination of retinal neurons 
83% accuracy was achieved also using a spectral decision tree classifier. 

Classification using the newly proposed NCD (IIII.3I ) is much more straightforward and leads to 
significantly better results. Given multisets A and B, each consisting of cells having a given fate, and 
a cell x with unknown fate, we proceed as follows. We assign x to whichever multiset has its distance 
(more picturesque "diameter") increased the least with the addition of x. In other words, if 

NCD(Ax) - NCD(A) < NCD(Bx) - NCD(B), (VII.l) 

we assign x to multiset A, else we assign x to multiset B. (The notation Xx is shorthand for the multiset 
X with one occurrence of x added.) For this first dataset, we did not need to evaluate the subset term, 
or second term in the outer maximization in (HII.31 as detailed in the following section. 

The classification accuracy improved considerably using the newly proposed NCD for multiples. For 
the terminal versus self -renewing question, we achieved 100% accuracy in prediction compared to 
99% accuracy for the multiresolution spectral pairwise NCD. For the two photoreceptor versus other 
combination question, we also achieved 100% accuracy compared to 87%. Finally, for the specific 
combination of retinal neurons we achieved 92% accuracy compared to 83% with the previous method. 
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B. Synthetic Time Sequence Data 

In addition to the retinal progenitor cell data, we applied the new NCD for multiples to analyzing 
synthetically generated time sequence data intended for the characterization of new machine learning 
algorithms |[T0l . The testing data here consists of 300 numerical time sequences, each containing 60 time 
points and belonging to one of six classes. This data is classified with 88% accuracy using a nearest 
neighbor Euclidian distance classifier. (We note that other methods such as dynamic time warping (DTW) 
have achieved considerably better results.) 

In applying the NCD to this data, we first measure the separation between classes or the margin. Given 
multisets A and B, each corresponding to a class in the testing data, we measure the separation between 
the two classes as 

NCD(AB) - NCD(A) - NCD(B). (VII.2) 

This follows directly from the relevant Venn diagram. Our approach is to ensure that the separation 
between classes is larger than any separation between subsets of the same class. If the separation between 
classes is larger than the separation within any class, we can ignore the subset component of (IIII.3I ). 
Verifying this requires us to evaluate the NCD over the powerset of each class and is this is not feasible. 
Instead, we have developed an approximate approach based on an expectation maximization algorithm 
to partition the classes such that there exist no subsets of a class separated by a margin larger than the 
minimum separation between classes. 

Our expectation maximization algorithm attempts to partition the classes into maximally separated 
subsets as measured by (IVII.2I ). This algorithm, that we have termed K-Lists, is modeled after the K- 
means algorithm. Although it is suitable for general clustering, here we use it to partition the data 
into two maximally separated subsets. The algorithm is detailed in Figure [T] There is one important 
difference between proposed K-Lists algorithm and the K-Means algorithm. Because we are not using 
the centroid of a cluster as a representative value as in K-Means, but rather the subset itself via the 
NCD for multiples, we only allow a single element to change subsets at every iteration. This prevents 
thrashing where groups of elements chase each other back and forth between the two subsets. This step 
is computationally demanding, but it is an inherently parallel computation. 

For the retinal progenitor cell data described in the previous section, the K-Lists partitioning algorithm 
was not able to find any subsets for any of the three questions that had a larger separation as measured by 
dVII.2b compared to the separation between the classes. For the synthetic data, the partitioning algorithm 
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1) (Initialize) Pick two elements (seeds) of X at random, assigning one element to each A and B. 
For each remaining element x, assign x to the closer one of A or B using pairwise NCD to the 
random seeds 

2) For each element x, compute the distance from x to class A and B using (IVII.1I ) and assign to 
whichever class achieves the smaller distance. 

3) Choose the single element that wants to change subsets, e.g. from A to B or vice versa and whose 
change maximizes NCD(AB) - NCD(A) - NCD(B) and swap that element from A to B or 
vice versa. 

4) Repeat steps 2 and 3 until no more elements want to change subsets or until we exceed e.g. 100 
iterations. 

Repeat the whole process some fixed number of times (here we use 5) for each X and choose the 
subsets that achieve the maximum of e(AB) — e(A) — e{B). If that value exceeds the minimum inter- 
class separation then divide X into A and B and repeat the process for A and B. If the value does 
not exceed the minimum inter-class separation of our training data, then accept X as approximately 
monotonic and go on to the next class. 

Fig. 1. Partitioning algorithm for identifying maximally separated subsets For each class (multiset) X, partition X into two 
subsets A and B such that NCD(AB) - NCD(A) - NCD(B) is a maximum 



was consistently able to find subsets with separation larger than the between class separation. For the 
synthetic data, the partitioning was run repeatedly and the best partitioning selected using cross validation. 
The accuracy of this preliminary classification was 88% correct. This is equivalent to the nearest neighbor 
Euclidean-distance classifier, and less accurate than the classifiers that used DTW. 

C. Data, Software, Machines 

All of the software and the time sequence data for the RPC fate outcome problem can be downloaded 



from http://bioimage.coe.drexel.edu/ncdm The data for the synthetic time sequence can be downloaded 
from ifTOl . The software is implemented in C and uses MPI for parallelization. Data import is handled by 
a MATLAB script that is also provided. The software has been run on a very small cluster, consisting of 
150 (hyperthreaded) Xeon and 17 cores running at 2.9 Ghz. The RPC classification runs in approximately 
20 minutes for each question, while the partitioning and classification of the synthetic data takes a few 
hours. 
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